04 Spark SQL essentials



In [ ]:

    
# Make it Python2 & Python3 compatible
from __future__ import print_function

SQL context

The Spark kernel provided in the Notebook automatically creates an SQLContext object called sqlContext (also with the sqlCtx alias), as it did for the SparkContext object in sc. Let's take a look at it:



In [ ]:

    
?sqlContext

We can inspect some of the SqlContext properties:



In [ ]:

    
print(dir(sqlContext))

The first action executed on a sqlContext in a Spark session will take extra time, since Spark uses this moment to initialize all the Spark SQL scaffolding.



In [ ]:

    
# Create a tiny dataset
sqlContext.range(1, 7, 2).collect()

Spark session

On $Spark \ge 2$ there is an additional class: the SparkSession is meant to supersede sqlContext, providing the same functionality plus additional capabilities.

The Notebook kernel pre-creates also a SparkSession object, putting it in the environment with the name spark



In [ ]:

    
if sc.version.startswith('2'):
    print( spark.version )
    print( type(spark) )

DataFrame

Simple test for the creation of a DataFrame



In [ ]:

    
data = [ ('Alice', 44), ('Bob', 32), ('Charlie',62) ]
df = spark.createDataFrame( data, schema=('name','age'))



In [ ]:

    
df.toPandas()



In [ ]: